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Abstract 

Secure similar document detection (SSDD) identifies similar documents of two parties while each 
party does not disclose its own sensitive documents to another party. In this paper, we propose an efficient 
2-step protocol that exploits a feature selection as the lower-dimensional transformation and presents dis¬ 
criminative feature selections to maximize the performance of the protocol. For this, we first analyze that 
the existing 1-step protocol causes serious computation and communication overhead for high dimen¬ 
sional document vectors. To alleviate the overhead, we next present the feature selection-based 2-step 
protocol and formally prove its correctness. The proposed 2-step protocol works as follows: (1) in the 
filtering step, it uses low dimensional vectors obtained by the feature selection to filter out non-similar 
documents; (2) in the post-processing step, it identifies similar documents only from the non-filtered 
documents by using the 1-step protocol. As the feature selection, we first consider the simplest one, 
random projection(RP), and propose its 2-step solution SSDD-RP. We then present two discrimina¬ 
tive feature selections and their solutions: SSDD-LF (local frequency) which selects a few dimensions 
locally frequent in the current querying vector and SSDD-GF (global frequency) which selects ones 
globally frequent in the set of all document vectors. We finally propose a hybrid one, SSDD-HF (hybrid 
frequency), that takes advantage of both SSDD-LF and SSDD-GF. We empirically show that the pro¬ 
posed 2-step protocol outperforms the 1-step protocol by three or four orders of magnitude. 

Keywords: secure similar document detection, cosine similarity, feature selection, lower-dimensional 
transformation, term frequency, document frequency 


1 Introduction 


Si mi lar document detection is the problem of finding similar documents of two parties, Alice and Bob, and it 
has been widely used in version management of files, copyright protection, and plagiarism detection ll24l[25 1. 
Recently, secure similar document detection (SSDD) fT5ll has been introduced to identify similar documents 
while preserving privacy of each party’s documents as shown in Figure [T] That is, SSDD finds similar 
document pairs whose cosine similarity l[T3l [26j | exceeds the given tolerance while not disclosing document 
vectors to each other party. SSDD is a typical example of privacy-preserving data mining (PPDM) 03 [2 16l . 
and has the following applications ifTSTl . First, in two or more conferences that are not allowing double 
submissions, SSDD finds the double-submitted papers while not disclosing the papers to each other confer¬ 
ence. Second, in the insurance fraud detection system, SSDD searches similar accident cases of two or more 
insurance companies while not providing sensitive and private cases to each other company. 



Figure 1: Concept of secure similar document detection. 


Jiang et al. IT5l have proposed a novel solution for SSDD by exploiting secure multiparty computa¬ 
tion (SMC) ||9j, [22] in a semi-honest model. Their solution has preserved privacy of two parties by using 
the secure scalar product in computing cosine similarity between document vectors. As the secure scalar 
product, they have suggested random matrix and homomoiphic encryption methods Ifl2ll28l . In this paper, 
we use the random matrix method as a base protocol, and we call it SSDD-Base. However, SSDD-Base 
has a critical problem of incurring severe computation and communication overhead. Let Alice’s and Bob’s 
document sets be U and V, respectively, then SSDD-Base requires |U| |V| secure scalar products. In many 
cases, the dimension n of document vectors reaches tens of thousands or even hundreds of thousands, and 
SSDD-Base incurs a very high complexity of 0(n|U||V|), which is not practical to support a large volume 
of document databases. In particular, if there are many parties or frequent changes in document databases, 
the overhead becomes much more critical. 












To alleviate the computation and communication overhead of SSDD-Base, in this paper we present a 
2-step protocol that exploits the feature selection of lower-dimensional transformation. The feature selection 
transforms high dimensional document vectors to low dimensional feature vectors, and in general it selects 
tens to hundreds dimensions from thousands to tens of thousands dimensions. We call the feature selection 
FS in short. Representative FS includes RP(random projection) 0, DF(document frequency) (27l . and 
LDA (linear discriminant analysis) Q. In this paper, we use RP and DF since they are known as simple 
but efficient feature selections (27]. To devise a 2-step protocol, we need to find an upper bound of cosine 
similarity for the filtering process. Thus, we first present an upper bound of FS and formally prove its 
correctness. Using the upper bound property of FS, we then propose a generic 2-step protocol, called SSDD- 
FS. The proposed SSDD-FS works as follows: in the first filtering step, it converts //-dimensional vectors 
to /(< n)-dimensional vectors and applies the secure protocol to /-dimensional vectors to filter out non¬ 
similar n-dimensional vectors; in the second post-processing step, it applies the base protocol SSDD-Base 
to the non-filtered n-dimensional vectors. In the filtering step, SSDD-FS prunes many non-similar high 
dimensional vectors by comparing low dimensional vectors with relatively less complexity of 0(/|U||V|), 
and thus, it significantly improves the performance compared with SSDD-Base. 

To make SSDD-FS be efficient, FS should be highly discriminative, i.e., FS should filter out as many 
high dimensional vectors as possible if they are non-similar. In this paper, we analyze SSDD protocols in 
detail and propose four different techniques as the discriminative implementation of FS. We can think RP 
first as an easiest way of implementing FS. RP randomly selects / dimensions from n dimensions. RP is 
easy, but its filtering effect will be very low due to the randomness. To solve the problem of RP, we exploit 
DF that selects feature dimensions based on frequencies in all document vectors. In particular, by referring 
the concept of DF, we present three variants of DF, called LF (local frequency), GF (global frequency), and 
HF (hybrid frequency). First, LF considers term frequencies of Alice’s current querying vector (we call 
it the current vector), and it selects dimensions whose frequencies higher than the others in the current 
vector. LF focuses on the locality, which means that considering the current vector only might be enough to 
decrease the upper bound of cosine similarity. Second, GF means DF itself, that is, GF counts the number of 
documents containing each term (dimension), constructs a frequency vector from those counts (we call it the 
whole vector), and selects high frequency dimensions from the whole vector. GF focuses on the globality 
since it considers all the document vectors. To implement GF, however, we need to make a secure protocol 


Table 1: Feature selection methods to be used for SSDD-FS. 


Method 

Description 

SSDD-RP 

Randomly select / dimensions from an n-dimensional vector. 

SSDD-LF 

Select highly frequent / dimensions from Alice’s n-dimensional current vector. 

SSDD-GF 

Select highly frequent / dimensions from the n-dimensional whole vector. 

SSDD-RP 

Select high-valued / dimensions from the n-dimensional difference vector between 
current and whole vectors. 


for obtaining the whole vector from both Alice’s and Bob’s document sets. For this, we propose a secure 
protocol SecureDF as a secure implementation of DF. Third, F1F takes advantage of both locality of LF 
and globality of GF. F1F computes a difference vector between current and whole vectors and selects high¬ 
valued dimensions from the difference vector. This is because F1F tries to maximize the value difference 
between Alice’s and Bob’s vectors for each selected dimension and eventually decrease the upper bound of 
cosine similarity. Table [Qsummarizes these four feature selections and their corresponding SSDD protocols, 
SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-FIF, to be proposed in SectionH] 

In this paper, we empirically evaluate the base protocol, SSDD-Base, and our four SSDD-FS pro¬ 
tocols (SSDD-RP, SSDD-LF, SSDD-GF, SSDD-FIF) using various data sets. Experimental results show 
that the SSDD-FS protocols significantly outperform SSDD-Base. This means that the proposed 2-step 
protocols effectively prune a large number of non-similar sequences early in the filtering step. In particular, 
SSDD-HF that takes advantage of both locality of SSDD-LF and globality of SSDD-GF shows the best 
performance. Compared with SSDD-Base, SSDD-HF reduces the execution time of SSDD by three or 
four orders of magnitude. 

The rest of this paper is organized as follows. Section 0 explains related work and background of the 
research. Section |3]presents the FS-based 2-step protocol, SSDD-FS, and proves its correctness. Section[4] 
introduces four novel feature selections, RP, LF, GF, and HF, and it proposes their corresponding secure 
protocols. Section [5] explains experimental results on various data sets. We finally summarize and conclude 
the paper in Section [6] 









2 Related Work and Background 


We use cosine similarity as the basic operation of similar document detection. The cosine similarity of two 
n-dimensional vectors if = {rti, ..., u n } and if = {v \...., v n } is computed as cos(lf, if) = || Jy'|jV ||, 
where if - if is the scalar product of if and if, that is, Vi. If we can compute if ■ if 

securely in two parties, we can also compute cos (if, if) securely. There are two representative methods 
for the secure scalar product fl31 . The first one is the random matrix method |[28l . where two parties share 
the same random matrix and compute the scalar product securely using the matrix. The second one is the 
homomorphic encryption method llTZil . where two parties use the homomorphic probability key system for 
the secure computation of scalar products. In this paper, we use the random matrix method since it is more 
efficient than the homomorphic encryption one, but we can also instead use the homomorphic encryption 
method for the protocols to be discussed later. Without loss of generality, we assume that vectors if and 1' 
are normalized to size 1. That is, ||(7|| = || V|| = 1, and thus, simply cos (if, lf) = lf-lf. 


Figure [2] shows the protocol of SSDD-Base, the recent solution of SSDD by Jiang et al. flTl l. SSDD- 
Base uses the random matrix method l28l for secure scalar products, where Alice and Bob share the same 
matrix A and securely determine whether two vectors if and 1f are similar or not. For the correctness and 
detailed explanation on Protocol SSDD-Base, readers are referred to fl5l . In SSDD, we perform SSDD- 
Base for each pair of document vectors. More formally, if U and V are sets of document vectors owned by 
Alice and Bob, respectively, we perform SSDD-Base for each pair (u , if), where if € U and if € V. 
As we mentioned in Section [H however, SSDD-Base incurs the severe computation and communication 
overhead of 0(n ||f/| ||V||), which will be much serious if there are several parties, or a large number of 
documents are changed dynamically. To alleviate this critical overhead, in this paper we discuss the 2-step 
solution for SSDD. 


In text mining and time-series mining, many lower-dimensional transformations have been proposed 
to solve the dimensionality curse problem 01 [20j] of high dimensional vectors. We can classify lower¬ 
dimensional transformations into feature extractions and feature selections l(23l l27ll . First, the feature ex¬ 
traction creates a few new features from an original high dimensional vector. Representative examples 
of feature extractions include LSI (latent semantic indexing) fT0ll30l . LPI (locality preserving indexing) (6], 
DFT (discrete Fourier transform) ltTTHT7l[2H . DWT (discrete Wavelet transform) fStiTSl, and PAA (piecewise 



Protocol SSDD-Base 

(1) U and V are ^-dimensional vectors owned by Alice and Bob, respectively. 

(2) Assume that Alice and Bob maintain the same n x nj 2 matrix A = [cf, ; ] ■ 

Alice: 

1. Generate n/2 random numbers r, ,r 2 ,...,r„ /2 ; 

2. Compute an n-dimensional vector Z , where z ; = u t + a i t ■ r, h— + a hrl/1 ■ r n/2 ; 

3. Send Z to Bob; 

Bob: 

4. Compute a scalar value S = Z-V; 

5. Compute an n/2 -dimensional vector Y, where y i =a li -v 1 -\ -I ~a ni -v n ; 

6. Send Y with S to Alice; 

Alice: 

7. Compute a scalar value S’ = Y-R ; 

8. Obtain 5 = S - S'; //Notethat 5 = U-V 

9. if 5 > s then Identify LI and V as a similar one; // e = user-specified tolerance 

Figure 2: Protocol of SSDD-Base. 

aggregate approximation) fT4l 131 1. In contiary. the feature selection selects a few discriminative features 
from an original (or transformed) high dimensional vectors. Representative examples of feature selections 
include RP, DF, LDA, and PC A (principal component analysis) 11511711271. In this paper, we use RP and DF 
with appropriate variations. This is because RP and DF are much simpler than other transformations, and ac¬ 
cordingly, they are easily applied to SSDD with low complexity; on the other hand, LSI, LPI, LDA, and PCA 
may provide very accurate feature vectors, but they are too complex to be applied to SSDD. For the detailed 
explanation on lower-dimensional transformations for text mining, readers are referred to 1231127 , .301. 

There have been many efforts on PPDM 0. PPDM solutions can be classified into four categories: 
data perturbation, fc-anonymization, distributed privacy preservation, and privacy preservation of mining 
results fl9ll . SSDD can be regarded as an application of distributed privacy privation. For the detailed 
explanation on problems and solutions of data perturbation and /e-anonymization, readers are referred to 
survey papers cm. 



3 Feature Selection-based Secure 2-Step Protocol 


In this paper, we use FS, feature selection, for the secure 2-step protocol. To transform an n-dimensional 
vector to an /-dimensional vector, FS chooses randomly or highly frequent / dimensions from n dimen¬ 
sions, and thus, its transformation process is very simple. In this section, we first assume that FS can select 
/ dimensions from n dimensions in a secure manner, and we then propose the secure 2-step protocol of 
SSDD by using the secure FS. 


To use a lower-dimensional transformation F for SSDD, we need to find an upper bound function 
upper(JJ^,V^) that satisfies Eq. dTJ, where and are /-dimensional feature vectors selected from 
n-dimensional vectors, jJ and respectively, by the transformation F. In Eq. ([!]). Tj = | // |..... u n ) . 

= {vi, .. • ,v n }, = K,.. .,u F A, and V F = {wf,... ,vT}. 


cos 



< upper 




(i) 


The reason why the transformation F should satisfy Eq. ([Q) is that SSDD of using F should not incur any 
false dismissal, and this is known as Parseval’s theorem (the lower bound property of Euclidean distances) 
in time-series matching flTl fT8l 1201 . To obtain an upper bound of the lower-dimensional transformation F, 
we first define an upper bound of F as follows. 


Definition 1 If a lower-dimensional transformation F transforms n-dimensional vectors, JJ and l^, to 
/-dimensional vectors, and V^, respectively, we define an upper bound function of F, denoted by 
upper(U ^, V ^), as Eq. ©. 


/-> — D 2 (lP,V F ) 
upperl U F ,V F ) = 1 - V y 


( 2 ) 


where D 2 (U~^, V^) is the squared Euclidean distance between and , i.e., D 2 (U~^, V^) = T2i= 


i =1 I u i 


U; ~ 


,,F\2 


■ □ 


In this paper, we want to use FS as a lower-dimensional transformation F, and thus, we formally prove that 
the upper bound function of FS satisfies Eq. dT]), the upper bound property of cosine similarity. 



vec- 


Theorem 1 If a feature selection FS transforms n-dimensional vectors, if and ~lf , to f -dimensional 
tors, U F ^ and V F \ respectively, upper(U F , V F °) is an upper bound of cos (if, ~f), that is, Eq. © 
holds. 

(3) 


cos upper 




Proof: First, let if = {u ±,..., u n }, = {iq,..., v n }, U F ^ = {wf 5 ,..., u Fs }, and V F ^ = {vf 5 ,..., v FS 

Then, Eqs. © and © hold for if and f. 


n 

D 2 (lf,lf ) = - Vl y 


2=1 


= Jf u i ~ 2 -Jf U FH + Jf v 2 
2=1 2=1 2=1 

= | tf| — 2 |If | |v*| + I V^l 

= 2 — 2 cos (if, . 




cos 


(FT) 


= 1 - 


d 2 ( u, V) 




(4) 


(5) 


We note that all entry values of if and are non-negative, and FS constructs U F ^ and V F ^ by choosing 
/ features from if and ~f. Based on this property, Eq. © holds. 


D 2 (Tf,t) = J>i -u*) 2 > £(uf S -«f S ) 2 = D 2 (u^,V^ 

i= 1 i=l V 

Finally, Eq. © holds by Eqs. ©, ©, and Eq. © of Definition Q] 


/_» D 2 (lfY) D 2 (u F f,V F f) /- 4 - 4 

(I= 1- Y -- < 1 ---g-“ = u PP er ( U™,V^ 


( 6 ) 


cos 


(V) 


Therefore, upper(U F ^, V F is an upper bound of cos (if, ff)- 


□ 


By using the upper bound property of FS, we now propose a generic 2-step protocol SSDD-FS. Fig¬ 


ure [3]shows Protocol SSDD-FS. As shown in the protocol, SSDD-FS maintains /-dimensional U F ^ and 
iflf as well as n-dimensional if and ~f of SSDD-Base. Also, Alice and Bob share an f x f/2 ma¬ 
trix A fs as well as an n x n/2 matrix A of SSDD-Base. Lines 1 to 7 of SSDD-FS are the first step 



























of discarding non-si mi lar n-dimensional vectors in the /-dimensional space. First, Lines 1 to 4 securely 


compute the scalar product 6 for /-dimensional vectors U F ^ and V F ^. Except using /-dimensional vectors 
instead of n-dimensional vectors, these steps are the same as those of SSDD-Base. The only difference 
from SSDD-Base is that Bob additionally sends V F ^ to Alice in Line 3 for computing I) 2 (U F K V F °). 
In Line 5, Alice computes A D 2 (U F ^, V F ^)) by using Eq. ©. 
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( 8 ) 


After then, Alice computes an upper bound function of FS, upper (U F ^, V F ^), in Line 6. In Line 7, we 
perform the filtering process by comparing the upper bound (= v) and the given tolerance (= e). If the upper 
bound is less than the tolerance, i.e. if v < e, the actual cosine similarity will also be less than the tolerance, 
and we don’t need to compute it in the next n-dimensional space. That is, if v < e, we can skip Line 8 of 
the second step. Thus, Line 8 is executed only if n-dimensional vectors of (if. if ) are not filtered out by 
the upper bound. In Line 8, we compute the actual cosine similarity for (if, if ) by using SSDD-Base. 

We here note that how SSDD-FS improves the performance compared with SSDD-Base depends on 
how many n-dimensional vectors are discarded in the first step. This filtering effect largely depends on the 
discriminative power of the feature selection, i.e., efficiency of FS. In other words, if FS exploits the filtering 
effect largely, SSDD-FS can reduce the computation and communication overhead from 0(n \U\ \ V |) to 
0(f | U \V\). Based on this observation, we need to maximize the filtering effect of FS, and this can be seen 
a problem of how we choose / dimensions from n dimensions for maximizing the discriminative power of 
FS. Therefore, we propose efficient FS variants and their SSDD protocols in Section [4] and evaluate their 
performance in Section [5] 

4 Discriminative Feature Selections for the 2-Step Protocol 


In this section, we propose four methods to implement FS of Protocol SSDD-FS. Figure 0] shows the 
procedure of SSDD-FS including the feature selection step. As shown in the figure, we first obtain U F ^ 
and V r ‘ f from if and if through the feature selection which should also be done securely. As mentioned 
in Section [T] we present RP, LF, GF, and HF as the feature selection method, and we explain how they work 









































Protocol SSDD-FS 

(1) U and V are n-dimensional document vectors; 

U FS and V FS are their/-dimensional feature vectors. 


(2) Assume that Alice and Bob share the matrices A and A Fi of sizes nxn/2 and /x//2, 
respectively. 

[I st step] The filtering step in the/-dimensional space] 

Alice: 

1. Execute Lines 1 to 3 of SSDD-Base for U and A FS instead of U and A; 

Bob: 


2. Execute Lines 4 to 6 of SSDD-Base for V 

3. Compute V FS and send it to Alice; 

Alice: 


and A' ’ instead of V and A ; 


4. Execute Lines 7 and 8 of SSDD-Base; // 8 = U FS ■ V rs 


5. Compute 


U F 


and A = 


U F 


-26 + 


V F 
(i U n ,V*s 


// a = d 


(u FS ,v FS ) 


6. Compute u = l——; // u = upper 

7. if ucs then Discard the pair (U, Vj as a non-similar one;//u = upper bound 

[2 nd step] The post-processing step in the H-dimensional space] 

Alice and Bob: 

8. Execute Lines 1 to 9 of SSDD-Base if (L/, V j is not discarded in the 1 st step; 


Figure 3: Protocol of the generic 2-step solution SSDD-FS. 


in detail in Sections l4~Tl to 14.41 In Figure 01 the secure feature selection corresponds to Line (1) of Protocol 
SSDD-FS. and the other two steps correspond to the first step (Lines 1 to 7) and the second step (Line 8), 
respectively. 


4.1 RP: Random Projection 

RP is an easiest way of implementing FS, which selects / dimensions randomly from n dimensions. We can 
think two different methods in applying RP to SSDD-FS. The first one selects / dimensions dynamically 
for each document pah (if, V^); the second one first determines / dimensions and then uses those pre¬ 
determined dimensions for all document pairs. 

To use the first RP method, Alice and Bob should share / indexes, i\,... ,if (1 < ij < n,j = 
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Figure 4: Feature selections in the process of SSDD-FS. 


1, of randomly selected / dimensions for each (U, V) before starting the first step of SSDD-FS. 
This sharing process can be implemented as Alice randomly selects / dimensions and sends their indexes 
to Bob, or Alice and Bob share the same seed of the random function. That is, we can implement the first 
RP method by modifying Line (1) of Protocol SSDD-FS as Lines (1-1) to (1-3) of Figure [5] 

(1-1) U and V are 77 -dimensional document vectors. 

(1-2) Alice randomly chooses/dimensions, fi, ..., z/(l < ij < n), and sends them to Bob. 

(1-3) U FS and V Fb are/-dimensional feature vectors extracted from U and V by using those 
/indexes. // uf =u ij ,v FS =v i .,j = l,...,f . 

Figure 5: Modification of Line (1) of SSDD-FS to implement SSDD-RP. 


The second RP method uses the same / dimensions for all ( U , V ) pairs. We can easily implement 
this method as Alice and Bob share the same / indexes only once before starting SSDD-FS. These first and 
second RP methods do not disclose any values of Alice’s and Bob’s document vectors, and thus, they are 
said to be secure. Also, these two methods have the same effect in selecting / dimensions randomly. Thus, 
we use the second one since it is much simpler than the first one, and we call the second one SSDD-RP by 
differentiating it from SSDD-FS. 


4.2 LF: Local Frequency 

SSDD-RP proposed in Section l4~Tl has a problem of exploiting only a little filtering effect in the first filtering 
step. This low filtering effect is due to that RP chooses features without any consideration of characteristics 

















































of document vectors. According to the real experiments, SSDD-RP shows a very little improvement in 
SSDD performance compared with SSDD-Base. To solve the problem of SSDD-RP and to enlarge the 


filtering effect, in this paper we consider how frequent each term is in the document or document set, i.e., we 
use the term frequency (TF^O- In general, we use the TF concept as follows: we first compute the number of 
occurrences (i.e., frequency) of each term throughout the whole data set and then choose the highly frequent 
dimensions. We call this selection method DF (document frequency) as in {27l . The reason why we consider 
TF(or DF) in SSDD-FS is that, if we select the highly frequent / dimensions, we can obtain relatively 
small upper bounds upper{U h , V 1 " )’s by relatively large D 2 (U ^, V^)’s of Eq. ([2]), and accordingly, we 
can exploit the filtering effect largely. 


As a feature selection using term frequencies, we first consider how frequent each term is in an individ¬ 
ual document rather than the whole document set, that is, we first propose the feature selection of exploiting 
locality of each document. More precisely, for a pair of documents (if, ~\f), the locality-based selection 
chooses / dimensions highly frequent in Alice’s current vector if. This selection is based on the simple 
intuition that, even without considering whole vectors of the document set, the current vector itself will 
make a big influence on the upper bound upper (AD . In this selection, we can instead use Bob’s vector 
~f rather than Alice’s vector If as the current vector, or we can also use both Alice’s and Bob’s vectors 
if and \f. Using P, however, incurs the additional communication overhead, and thus, in this paper we 
consider a simple method of using Alice’s if as the current vector. We call this selection method LF(local 
frequency) since it considers individual (i.e., local) documents rather than whole documents, and we denote 
the protocol of applying LF to SSDD-FS as SSDD-LF. 


SSDD-LF exploits the locality by selecting / dimensions for each document at every start time. Fig¬ 
ure |6]shows how we implement SSDD-LF by modifying Line (1) of SSDD-FS of Figure [3] In Line (1-2), 
Alice first selects top / frequent dimensions from her current vector if. She sends those indexes of the se¬ 
lected / dimensions to Bob in Line (1-3). Thus, they can share the same indexes and obtain /-dimensional 
feature vectors by using the same / indexes in Line (1-4). 

We now analyze the computation and communication overhead of feature selection in SSDD-LF. As 
shown in Figure [6] for each vector if , Alice (1) chooses the top / frequent dimensions from n dimensions 

'in this paper, we use TF for simplicity, but we can also use TF-IDF(term frequency-inverse document frequency) instead of 
TF. Using which frequency among TF, TF-IDF, and other feature frequencies is orthogonal to our approach, and we use TF for easy 
understanding of the proposed concept. 



(1-1) U and V are 77 -dimensional document vectors. 

(1-2) Alice chooses/dimensions, u, . .if (1 < ij < n), whose TFs are larger than other dimensions. 

(1-3) Alice sends those/indexes to Bob. // This can be done together with Line 3 of Figure 2 

(1-4) LZ rs and V FS are/-dimensional feature vectors extracted from U and V by using those 

/indexes. // 7 i FS =u ij ,vf =v i .,j = l,...,f . 

Figure 6: Modification of Line (1) of SSDD-FS to implement SSDD-LF. 

of iJ and (2) communicates with Bob to share those / indexes. First, Alice needs the additional compu¬ 
tation overhead of 0(n\ogf) to select top / frequent dimensions from the current n-dimensional vector. 
Second, Alice and Bob need the additional communication overhead to share the / indexes. Flowever, this 
communication process can be done with Line (3) of SSDD-Base of Figure [2j that is, Alice can send / 
indexes together with the encrypted vector 77 to Bob. The amount of / indexes is much smaller than that of 
the n-dimensional vector, and the overhead of / indexes can be negligible. Thus, we can say that SSDD-LF 
causes the computation overhead of 0(n log/), but the communication overhead can be ignored. In par¬ 
ticular, we compare each vector Ij of Alice with a large number of vectors (g V) of Bob, and thus, the 
computation overhead of 0(n log /) can also be ignored as a pre-processing step. 

Another considering point in SSDD-LF is whether its feature selection process is secure or not. That is, 
there should be no privacy disclosure when Alice selects / indexes and shares them with Bob. Fortunately, 
Alice sends only indexes ij to Bob rather than entry values Uj. of 77 , and the sensitive values Ui j are not 
disclosed in the selection process. Unfortunately, however, the information that which / dimensions are 
frequent in ij is revealed to Bob. If the user cannot be allowable even this limited disclosure of information, 
s/he cannot use SSDD-LF as a secure protocol. In this case, we recommend to use the previous SSDD-RP 
or the next SSDD-GF or SSDD-FIF as the more secure protocol. 

4.3 GF: Global Frequency 

SSDD-LF of Section l4~2l has a problem of considering only Alice’s current vector but ignoring all the other 
vectors of Bob. Due to this problem, SSDD-LF exploits the filtering effect for only a part of Bob’s vectors, 
but it does not for most of other vectors. To overcome this problem, in this section we propose another 



feature selection that uses the whole vector of which each element represents the number of documents 
containing the corresponding term. Unlike LF of focusing on the current vector only, it considers whole 
document vectors, and it has characteristics of globality. We call this feature selection GF (global frequency) 
and denote the GF-based secure protocol as SSDD-GF. Actually, GF is the same as DF, which has been 
widely used as the representative feature selection, and it works as follows. First, let ~A = {ai, ..., a n } be 
a whole vector and ak be a number of documents containing the /r-th term, that is, be the DF value of the 
k -th term. Then, to reduce the number of dimensions from n to /, GF simply selects / dimensions whose 
DF values are larger than those of the other (n — /) dimensions. We can get the whole vector by scanning 
all the document vectors once. The traditional DF constructs the whole vector based on the assumption that 
all the document vectors are maintained in a single computer. In SSDD, however, document vectors are 
distributed in Alice and Bob, and they do not want to provide their own vectors to each other. Thus, to use 
GF in SSDD, we first need to present a secure protocol of constructing the whole vector from the document 
vectors distributively stored in Alice and Bob. 


Figure [7] shows Protocol SecureDF that securely constructs a whole vector 7( from Alice’s and Bob’s 
document vectors and gets / frequent dimensions from A. In Lines 1 to 8, Alice and Bob computes their 
own whole vectors independently. That is, Alice computes her own whole vector A Alice from her own 


document set U, and Bob gets A Bo ° from V. In Lines 4 and 8, they share those whole vectors A Ahce and 
A Bob with each other. In Lines 9 to 11, they then compute the aggregated whole vector 7f from those 
vectors. After obtaining the whole vector 7t, Alice and Bob can select / frequent dimensions from 77 


We note that Alice sends A Ahce to Bob in Line 4, and Bob sends A ,lh to Alice in Line 8. Vectors A Ah 


and A Bob , however, are not exact values of document vectors, but simple statistics, and thus, we can say 
that SecureDF does not reveal any privacy of individual documents. Computation and communication 
complexities of SecureDF are merely 0(n |U| + n |V|) and 0(n), respectively. Also, SecureDF can be 
seen as a pre-processing step executed only once for all document vectors of Alice and Bob. Thus, its 
complexity can be negligible compared with the complexity (n |U| |V|) of SSDD-Base. 


We now explain SSDD-GF which exploits SecureDF as the feature selection. Figure [8] shows how 
we modify Line (1) of Figure [3] for converting SSDD-FS to SSDD-GF. In Line (1-0), we first perform 
SecureDF to obtain the whole vector 7f and determine / indexes which are most frequent in 77 For 
current n-dimensional vectors U and Alice and Bob get /-dimensional vectors U F ^ and V F ^ by using 












Protocol SecureDF 

(1) U is a set of n-dimensional vector owned by Alice. 

(2) V is a set of n-dimensional vectors owned by Bob. 

Alice : 

1. for each dimension d do 

2. Compute af hce = the number of occurrences in U; 

3 end-for 

4 Send Af ce to Bob; 

Bob: 

5. for each dimension k do 

6 . Compute af° b = the number of occurrences in V; 

7 end-for 

8 Send A™ to Alice ; 

Alice and Bob: 

9. for each dimension k do 

10. Compute a k = a k hce + af ° b ; // fl/t = DF value of the k th dimension 

11. end-for 

12. Choose/dimensions, i\, if, whose DF values of A are larger than the other dimensions; 


Figure 7: Secure protocol for constructing the whole vector. 

the determined / indexes. As shown in Figure [8j the current vectors and even their term frequencies are not 
disclosed to each other, and thus, we can say that SSDD-GF is a secure protocol of SSDD. 

(1-0) Obtain A through SecureDF and determine/dimensions, i\, if (1 < i, < n), from A . 

(1-1) If and V are n-dimensional document vectors. 

(1-2) U FS and V Fb are/-dimensional feature vectors extracted from U and V by using those 
/indexes. // uf =u i .,v FS =v i .,j =1,...,f . 

Figure 8: Modification of Line (1) of SSDD-FS to implement SSDD-GF. 


4.4 HF: Hybrid Frequency 

LF and GF proposed in Sections l4~2l and l4~3l have the following characteristics in a viewpoint of the filtering 
effect. First, LF considers Alice’s current vector Tf only, and thus, the filtering effect will be large for only a 
part of Bob’s vectors whose TF patterns much differ from the current vector, but the effect are less exploited 
for most of the other vectors. In other words, LF can exploit the better filtering effect than GF when Alice’s 





current vector quite differs from the whole vector in TF patterns. Second, GF considers the whole vector 
~A obtained by SecureDF without considering the current vector, and it thus can exploit the filtering effect 
relatively evenly on many of Bob’s document vectors. That is, GF can exploit the better filtering effect than 
LF when Alice’s current vector has the similar characteristics with the whole vector in TF patterns. 


To take advantage of both locality of LF and globality of GF, we now propose a hybrid feature selection, 
called ///Thybrid frequency). That is, F1F uses the current vector for exploiting locality of LF, and at the same 
time it also use the whole vector for exploiting globality of GF. We then present an advanced secure protocol 
SSDD-HF by applying F1F to the SSDD-FS. Simply speaking, HF compares current and whole vectors 
and selects feature dimensions whose differences are larger than those of the other dimensions. In more 
detail, we select feature dimensions which have one of the following two characteristics: (1) the dimensions 
which frequently occur in Alice’s current vector but seldom occur in the whole vector (i.e., whose values 
are relatively large in the current vector but relatively small in the whole vector); or on the contrary, (2) the 
dimensions which seldom occur in Alice’s current vector but frequently occur in the whole vector. This is 
because the larger \uf — vf | (= the difference between values of the selected feature dimension), the smaller 
upper{U ^, V*), i.e., the larger D 2 of Eq. ©, which exploits the larger filtering effect. 

Flowever, we cannot directly compare Alice’s current vector 7j and the whole vector 7f by SecureDF. 
The reason is that Tf represents “frequencies of terms” in a single vector while ~A represents “frequencies 
of documents” containing those terms. That is, the meaning of frequencies in U differs from that of /t, 
and thus, their scales are also different. To resolve this problem, before comparing two vectors 7J and 
~X, we first normalize them using their mean(= //) and standard deviation (= er). More precisely, we 
first normalize U and ~A to U and A by Eq. (J9J. and we next obtain the difference vector 7 ) = {d\ = 
| ui — ai|,..., d n = \ud. — «T| }■ After then, we select the largest / dimensions from 7 ) and use them as the 
features of SSDD-HF. 


Ui = 




(9) 


Figure |9] shows how we modify Line (1) of SSDD-FS in Figure [3] to implement SSDD-HF. First, as 
in SSDD-GF, Line (1-0) constructs the whole vector 7t by executing SecureDF. Next, in Lines (1-2) and 
(1-3), we normalize the current and whole vectors and obtain the difference vector 75 from those normalized 




vectors. Finally, in Lines (1-4) to (1-6), Alice chooses / dimensions from the difference vector D and shares 
those dimensions with Bob. That is, Lines (1-4) to (1-6) are the same as Lines (1-2) to (1-4) of SSDD-LF 
in Figure [6] except that SSDD-LF uses the current vector 1j while SSDD-FIF uses the difference vector 77. 

(1-0) Obtain A through SecureDF. 

(1-1) U and V are n-dimensional document vectors. 

(1-2) Alice normalizes U and A to U and A by using Eq. (9). 

(1-3) Alice obtains the difference vector D from U and A by d i =\u i -a i \. 

(1-4) Alice chooses /dimensions, u, ..if, whose d; values are larger than the other dimensions. 

(1-5) Alice sends those/indexes to Bob. // This can be done together with Line 3 of Figure 2 

(1-6) U FS and V FS are/-dimensional feature vectors extracted from U and V by using those 
/indexes. // u FS =u ij ,vf =v i .,j = . 

Figure 9: Modification of Line (1) of SSDD-FS to implement SSDD-FIF. 

The overhead of feature selection in SSDD-HF can be seen as the summation of those in SSDD-LF 
and SSDD-GF. That is, like SSDD-GF, it has the overhead of performing SecureDF to obtain the whole 
vector 7?, and at the same time, like SSDD-LF, it has the overhead of choosing the largest / dimensions 
from the n-dimensional difference vector 77. These overheads, however, can be negligible by the following 
reasons: (1) as we explained in SSDD-GF of Section |4~3] SecureDF having 0(n |U| +n |V|) and 0(n) of 
computation and communication complexities can be seen as a pre-processing step executed only once for 
all document vectors, and its overhead can be negligible in the whole process of SSDD; (2) as we explained 
in SSDD-LF of Section l4~2l the computation complexity 0(n log/) of choosing / dimensions from an 
n-dimensional vector can be ignored since it can also be seen as the pre-processing step. One more notable 
point is that SSDD-HF is a secure protocol like SSDD-GF since it uses SecureDF and the difference vector 
which are secure and do not disclose any original values or any sensitive indexes of individual vectors. 



5 Performance Evaluation 


5.1 Experimental Data and Environment 

In this section, we empirically evaluate feature selection-based SSDD protocols proposed in Section 4. As 
the experimental data, we use three datasets obtained from the document sets of UCI repository lf29l . These 
datasets are KOS blog entries, NIPS full papers, and Enron emails, which have been frequently used in 
text mining. The first dataset consists of KOS blog entries collected from dailykos.com, and we call it 
KOS. KOS consists of 3,430 documents with 6,906 different terms (dimensions), and it has total 467,714 
terms. The second dataset contains NIPS full papers published in Neural Information Processing Systems 
Conference, and we call it NIPS. NIPS consists of 1,500 documents with 12,419 different terms, and it has 
about 1.9 million terms in total. The third dataset contains e-mail messages of Enron, and we call it EMAILS. 
EMAILS consists of 39,861 e-mails with 28,102 different terms, and it has about 6.4 million terms in total. 

We experiment five SSDD protocols: SSDD-Base as the basic one and four proposed ones of SSDD- 
RP, SSDD-LF, SSDD-GF, and SSDD-HF. In the experiment, we basically measure the elapsed time of 
executing SSDD for each protocol. In the first experiment, we vary the number of dimensions for a fixed 
tolerance, where the number of dimensions means /, i.e., the number of selected features (dimensions) by 
the feature selection. In the second experiment, we vary the tolerance for a fixed number of dimensions. 
For these two experiments, we use KOS and NIPS, which have a relatively small number of documents 
compared with EMAILS. On the other hand, the third experiment is to test scalability of each protocol, and 
we thus use EMAILS whose number of documents is much larger than those of KOS and NIPS. 

The hardware platform is HP ProLiant ML110 G7 workstation equipped with Intel(R) Xeon(R) Quad 
Core CPU E31220 3.10GHz, 16GB RAM, and 250GB HDD; its software platform is CentOS 6.5 Linux. We 
use C language for implementing all the protocols. We perform SSDD in a single machine using a local loop 
for network communication. The reason why we use the local loop is that we want to intentionally ignore 
the network speed since different network speeds or environments may largely distort the actual execution 
time of each protocol. We measure the execution time spent for that Alice sends each document to Bob 
and identities its similarity securely. More precisely, we store the whole dataset in Bob and select ten query 
documents for Alice. After then, we execute each SSDD protocol for those ten query documents and use 
their sum as the experimental result. 


5.2 Experimental Results 


Figure [TO] shows the experimental results for KOS. First, in Figure ITOla). we set the tolerance to 0.80 and 
vary the number of documents by 70, 210, 350, 490, and 640, which correspond to 1%, 3%, 5%, 7%, and 
9% of KOS documents. As shown in the figure, x axis shows the number of (selected) dimensions, and y 
axis does the actual execution time. Note that the y axis is a log scale. 


-O—SSDD-Base -B-SSDD-RP —A—SSDD-LF —X—SSDD-GF —X—SSDD-FIF 



(a) Different numbers of dimensions 
(tolerance = 0.80). 



Tolerance 

(b) Different tolerances 
(number of dimensions = 70). 


Figure 10: Experimental results for KOS. 


Figure [l0[a) shows that all proposed protocols significantly outperform the basic SSDD-Base. Even 
SSDD-RP of selecting features randomly beats SSDD-Base by exploiting the filtering effect in the first 
step of the 2-step protocol. Next, SSDD-GF shows the better performance than SSDD-RP since it se¬ 
lects the frequently occurred features throughout the whole dataset by using DF. In case of SSDD-RP and 
SSDD-GF, we note that, as the number of dimensions increases, the execution time decreases. This is be¬ 
cause the more number of dimensions we use, the larger filtering effect we can exploit. SSDD-LF of using 
locality of the current vector also outperforms SSDD-RP as well as SSDD-Base. In particular, SSDD-LF 
is better than SSDD-GF for a small number of dimensions, but it is worse than SSDD-GF for a large num¬ 
ber of dimensions. This is because only a small number of dimensions make a big influence on the locality 
of the current vector. Finally, SSDD-HF of taking advantage of both SSDD-LF and SSDD-GF shows 
the best performance for all dimensions. In Figure ITOl a). we note that the execution time of SSDD-LF 






















and SSDD-HF slightly increases as the number of dimensions increases. The reason is that, as the number 
/ of dimensions increases, the filtering effect increases relatively slowly, but the overhead of obtaining a 
current/difference vector and choosing / dimensions from that vector increases relatively quickly. 

Second, in Figure ITOl h). we set the number of dimensions to 70 (1% of total dimensions) and vary the 
tolerance from 0.95 to 0.75 by decreasing 0.05. Note that the closer to 1.0 the tolerance is, the stronger 
similarity we use. As shown in the figure, all proposed protocols significantly improve the performance 
compared with SSDD-Base. In particular, SSDD-LF and SSDD-FIF, which exploits the locality, show 
the better performance than the other two proposed ones. We here note that, as the tolerance decreases, 
execution times of all proposed protocols gradually increase. This is because the smaller tolerance we use, 
the more documents we get as similar ones. That is, as the tolerance decreases, the more documents pass the 
first step, and thus, the more time is spent in the second step. In summary of Figure [lOj the proposed SSDD- 
LF and SSDD-FIF significantly outperform SSDD-Base by up to 726.6 and 9858 times, respectively. 

Figure QT] shows the experimental results for NIPS. As in Figure flOl of KOS, we measure the execution 
time of SSDD by varying the number of dimensions and the tolerance. In Figure fili al, we set the tolerance 
0.80 and increase the number of dimensions from 120 (1%) to 600 (5%) by 120 (1%), where 120 means 1% 
of total 12,419 documents. Next, in Figure [TTl bl. we set the number of dimensions to 120 and decrease 
the tolerance from 0.95 to 0.75 by 0.05. The experimental results of Figures [Til a) and ITTTb) show a very 
similar trend with those of Figures |T0l a) andflOfb). That is, all proposed protocols significantly outperform 
SSDD-Base, and SSDD-HF shows the best performance. In Figure [TT] SSDD-HF extremely improves 
the performance compared with SSDD-Base by up to 16620 times. 

Figure [12] shows the results for scalability test using a large volume of high dimensional dataset, 
EMAILS. We set the tolerance and the number of dimensions to 0.80 and 70, respectively, and we in¬ 
crease the number of documents (emails) from 40(0.1%) to 39,861 (100%) by 10 times. In this experiment, 
we exclude the results of SSDD-Base, SSDD-RP, and SSDD-GF for the case of 39,861 documents due 
to excessive execution time. As shown in the figure, like the results of KOS and NIPS, our feature selection- 
based protocols outperform SSDD-Base at all cases, and in particular, SSDD-LF and SSDD-HF show the 
best performance regardless of the number of documents. We also note that all proposed protocols show a 
pseudo linear trend on the number of documents. (Please note that x- and y-axis are all log scales.) That is, 
the protocols are pseudo linear solutions on the number of documents, and we can say that they are excellent 
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Figure 11: Experimental results for NIPS, 
in scalability as well as performance. 


6 Conclusions 

In this paper, we addressed an efficient method of significantly reducing computation and communication 
overhead in secure similar document detection. Contributions of the paper can be summarized as follows. 
First, we thoroughly analyzed the previous 1-step protocol and pointed out that it incurred serious perfor¬ 
mance overhead for high dimensional document vectors. Second, to alleviate the overhead, we presented the 
feature selection-based 2-step protocol and formally proved its correctness. Third, to improve the filtering 
efficiency of the 2-step protocol, we proposed four feature selections: (1) RP of selecting features randomly, 
(2) LF of exploiting locality of a current vector, (3) GF of exploiting globality of all document vectors, and 
(4) HF of considering both locality and globality. Fourth, for each feature selection, we presented its formal 
protocol and analyzed its secureness and overhead. Fifth, through experiments on three real datasets, we 
showed that all proposed protocols significantly outperformed the base protocol, and in particular', the HF- 
based secure protocol improved performance by up to three or four orders of magnitude. As the future work, 
we will consider two issues: (1) use of feature extraction (feature creation) instead of feature selection for 
dimensionality reduction and (2) use of homomorphic encryption rather than random matrix for the secure 
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Figure 12: Experimental results of scalability test for EMAILS. 

scalar product. 
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